# A tibble: 1 × 4
mean sd se count
<dbl> <dbl> <dbl> <int>
1 51.7 12.0 0.834 208
Lecture 03
Lecture 2: Review
- We covered:
- data wrangling and types of variable names
- meta data
- project design
- summary statistics
- graphing the mean and standard error graphs
- pipes and %>% or |> and how to group_by
Our last graph
Lecture 3: How to deal with data wrangling
Introduction to probability distributions
- What is a frequency distribution?
- What is a probability distribution?
- Distributions for variables and for statistics
Estimation
- Populations and samples
- Parameters and statistics
we are going to use some sculpin data that is real!
Lecture 3: Frequency distributions
Lecture 3: Frequency distributions
- Data - has been cleaned in terms of lake names and species names
- Slimy Sculpin - Toolik Lake
Lecture 3: Frequency Distributions
What is a frequency distribution?
- Display of number of observations in certain intervals
- e.g., the number of sculpin per interval in Toolik Lake
- as a table like below or histogram
# A tibble: 28 × 2
length_bin n
<fct> <int>
1 [11,13] 4
2 (19,21] 1
3 (23,25] 1
4 (27,29] 2
5 (29,31] 2
6 (31,33] 1
7 (33,35] 4
8 (35,37] 3
9 (37,39] 7
10 (39,41] 9
# ℹ 18 more rows
Lecture 3: Frequency Distributions
The alternative is to use a histogram
- the y axis is the count
- the x axis is the bin range
- each bin 0 - 5 and 5 - 10 and 10 - 15 or as you choose
- in ggplot the code looks like
dataframe %>% ggplot(aes(thing_to_count))+
geom_histogram(
binwidth = increments_to_work_with
)Lecture 3: Frequency Distributions
What happens as sample size changes…
Low sample number - 15
High sample number - 70
- Frequency distribution takes on “bell-shape”…
Lecture 3: Probability distributions
Can we make assumption about distribution of random variable weight in population?
Probability distribution:
- theoretical frequency distribution in population
Lecture 3: Probability distributions
For continuous random var: probability density function (PDF)
PDF: mathematical expression of probabilities associated with getting certain values of random variable
Area under curve = 1
i.e., probability of lenght between 10 and 80 = 1
Lecture 3: Probability distributions
Now we could look at a lot of different ranges of lengths - probability of the lenght larger than the mean - probability of the lenght larger than 70 mm - probabilioty of the lenght between two numbers
Lecture 3: Probability distributions
- Usually need to know probability distribution of random variables in statistical analyses
- Can define many distributions; some do reasonable job especially whit continuous varaibles
- Different distributions for continuous, discrete variables like a single die
asdfasdf
Lecture 3: Probability distributions
Normal (Gaussian): symmetrical, bell-shaped
- Defined in terms of mean and variance (μ, 𝜎2)
- SND (z-distribution) has mean μ=0 , 𝜎2 =1
\[f(y) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y - \mu)^2}{2\sigma^2}}\]
Lecture 3: Probability distributions
Lognormal: right-skewed distribution
Logarithm of random variable is normally distributed
Common in biology.
Why would this occur or be common in biology?
Lecture 3: Probability distributions
Binomial (multinomial):
- probability of event that have two outcomes (heads/ tails, dead/alive)
- Defined in terms of “successes” out of set number of trials
In large number of trials: approx. normal distribution
Lecture 3: Probability distributions
Poisson: occurrences of (rare) event in time/space
- E.g., number of
- Taraxacum officinale - common dandelion in quadrat
- copepod eaten per minute
- cells in field of view
- Measures Probability(y= certain integer value)
- defined in terms of μ or mean
- Right-skewed at small μ
- more symmetrical at higher μ
Lecture 3: Data gathering - managing
Also have distributions of test statistics
Test statistics:
- summary values calculated from data used to test hypotheses
- is your result due to chance?
Different test statistics:
different, well-defined distributions
allows estimation of probabilities associated with results
Examples:
- z-distribution, student’s t-distribution, χ2-distribution, F-distribution
Lecture 3: Samples and populations
asdfadsf
asdfasdf
Lecture 3: Samples and populations
asdfadsf
asdfasdf
Lecture 3: Samples and populations
asdfadsf
asdfasdf
Lecture 3: Samples and populations
Inferential statistics:
- inference from samples to populations
Statistical population:
- All possible observations of interest
- Normally: populations too large to census
Populations are defined in time + space
Examples of statistical populations from you research area?
Lecture 3: Samples and populations
Key characteristic of sample is
- size (n observations; n = sample size)
Characteristics of population - called parameters
- Parameters - Greek letters
Characteristics of samples - statistical estimates of parameters
- statistics= Latin letters
Random sampling crucial for
sample -> population
inference statistics -> parameters
Lecture 3: Parameters and statistics
Two main kinds of summary statistics: - center and spread
Center: - Mean (µ, ȳ): sum of sampled values divided by n - Mode: the most common number in dataset - Median: middle measurement of data; = mean for normal distributions
\(\mu = \frac{\sum\limits_{i=1}^{n} Y_i}{n}\)
Formula for n odd \(Y_{(n+1)/2} \text{ if } n \text{ odd}\)
Formula for n even \(\frac{Y_{n/2} + Y_{(n/2)+1}}{ 2} \text{ if } n \text{ even}\)
Lecture 3: Parameters and statistics
d
Lecture 3: Parameters and statistics
Spread
- Range: from highest and lowest observation
- Variance (σ2, s2): sum of squared differences of observations from mean, divided by n-1
E.g., fish lengths = 20, 30, 35, 24, 36 g
# A tibble: 1 × 1
mean
<dbl>
1 29
\(s^2 = \sum_{i=1}^{n} \frac{(y_i - \bar{y})^2}{n-1}\)
Lecture 3: Parameters and statistics
Spread
(20 -29)^2+ (30 -29)^2 + (35 -29)^2 + (24 -29)^2 + (36 -29)^2 = 57,104
192 / (5-1) = 48 mm^2 Problem: weird units!
# A tibble: 1 × 2
mean variance
<dbl> <dbl>
1 29 48
Lecture 3: Parameters and statistics
Spread
- Standard Deviation(σ, s): square root of variance.
In same units as observations
In example: √48 = 6.9 mm
- Coefficient of variation: SD as % of mean.
- Useful for comparing spread in samples with different means
- In example: (6.9/29)*100= 23.8 %
\(\sqrt{\sum_{i=1}^{n} \frac{(y_i - \bar{y})^2}{n-1}}\)
\(\frac{S}{\bar{Y}} X 100\)
Lecture 3: Estimation
Problem: - don’t know the values of parameters
Goal: - estimate parameters from empirical data (samples)
3 general methods of parameter estimation: - Maximum Likelihood Estimation (MLE) - Ordinary Least Squares (OLS) - Resampling techniques
MLE general method to estimate parameters in a way that maximizes the likelihood of the observed data given the parameter values.
aims to find the parameter values that make the observed data most probable under the assumed statistical model.
OLS specific method to estimate parameters of a linear regression model.
minimizes the sum of the squared differences between observed and predicted values